Okay, let's start.
Obviously winter is coming.
If you look around.
Yeah, one more thing.
We talked about Monte Carlo Tree Search, which is a method designed to come up with the
difficulty of finding evaluation functions.
Remember we have these search trees.
Those are much too big to explore.
And at the same time, we only get information about how well we're doing at the terminal
nodes.
Okay?
So Minimax kind of has the fiction.
You explore the whole tree, get at the data of your reward, propagate it all up.
And that doesn't work because the search trees are too big.
So what we do is we kind of, like a heuristic, make up values here.
How good does a certain state feel?
And of course, just like heuristics, it's difficult.
Having a good one gives us a good search.
Having a bad one does the opposite.
And so instead of Monte Carlo Tree Search, instead of inventing a heuristic, which we
call an evaluation function here, you must basically sample randomly a couple of these
and then kind of use the average result or something like this as an evaluation function.
Okay, how does that work?
Oh yes, and we briefly talked about AlphaGo because Monte Carlo Tree Search plus neural
networks learning how to work with the algorithm is actually a good solution, kind of completely
out of the blue solved the game of Go, which was by many considered unsolvable before.
Okay, so the idea is we're going to sample and what we're doing is we're creating a Monte
Carlo Tree, which rather than looking at this algorithm description, let's look at an example.
So we start with a root node.
The new stuff of course is that we're going to, with every node, associate a Monte Carlo
Tree Search record, which looks at how much we've expanded and what the average reward
is for every of the applicable actions.
Okay, so that's easy, that's just initializing the tree with the initial node.
So we decide for one randomly and then another one and another one and we do that all the
way down until we hit a goal node.
Remember goal node, those are the ones where you win or lose a draw.
And say we're getting a reward of 10 there, then we kind of completely wipe out this thing.
Say we've expanded the middle action one time and got an average reward of 10 here.
Jaws bookkeeping for the sampling.
And then we do it again, we decide on the left one and we go all the way down and say
we hit a 70, we remove everything, we do the bookkeeping and then we sample some more randomly.
And you can see that the games may have different lengths and so on.
And we keep on sampling and sampling, doing the bookkeeping and we acquire basically an
evaluation function.
Something that tells us how good the different actions actually smell.
And we sample some more randomly.
And then at some point we act on the action.
At this point here we've decided, oh 60 looks better than 55 or even 35.
And so we're going to exploit the sampling information and make an action.
And that was it because now we basically repeat this.
Presenters
Zugänglich über
Offener Zugang
Dauer
01:30:19 Min
Aufnahmedatum
2024-11-21
Hochgeladen am
2024-11-21 18:59:03
Sprache
en-US